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Abstract 

In this work, we introduce a deep-structured conditional random field (DS-CRF) model 
for the purpose of state-based object silhouette tracking. The proposed DS-CRF model 
consists of a series of state layers, where each state layer spatially characterizes the 
object silhouette at a particular point in time. The interactions between adjacent state 
layers are established by inter-layer connectivity dynamically determined based on 
inter-frame optical flow. By incorporate both spatial and temporal context in a 
dynamic fashion within such a deep-structured probabilistic graphical model, the 
proposed DS-CRF model allows us to develop a framework that can accurately and 
efficiently track object silhouettes that can change greatly over time, as well as under 
different situations such as occlusion and multiple targets within the scene. Experiment 
results using video surveillance datasets containing different scenarios such as occlusion 
and multiple targets showed that the proposed DS-CRF approach provides strong 
object silhouette tracking performance when compared to baseline methods such as 
mean-shift tracking, as well as state-of-the-art methods such as context tracking and 
boosted particle filtering. 


Introduction 

Structured prediction, where one wishes to predict structured states given structured 
observations, is an interesting and challenge problem that is important in a number of 
different applications, with one of them being object silhouette tracking. The goal of 
object silhouette tracking is to identify the silhouette of the same object over a video 
sequence, and is very challenging due to a number of factors such as occlusion, object 
motion changing dynamically over a video sequence, and object silhouette changing 
drastically over time. 

Much of early literature in object tracking have consisted of generative tracking 
methods, where the joint distribution of states and observations is modeled. The 
classical example is the use of Kalman filters [^, where predictions of the object are 
made with Gaussian assumptions made on both states and observations based on 
predefined linear system dynamics. However, since object motion do not follow Gaussian 
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behaviour and have non-linear system dynamics, the use of Kalman filters can often lead 
to poor prediction performance for object tracking. To address the issue of non-linear 
system dynamics, researchers have made use of modified Kalman filters such as the 
extended Kalman filter and unscented Kalman filter [^, but these do not resolve 
issues associated with non-Gaussian behaviour of objection motion. To address both the 
issue of non-linear system dynamics and non-Gaussian behaviour, a lot of attention has 
been paid to the use of particle filters [^[^, which are non-parametric posterior density 
estimation methods that can model arbitrary statistical distributions. However, the use 
of particle filters for object tracking is not only computationally expensive, but difficult 
to learn especially for the case of object silhouette tracking where motion and silhouette 
appearance can change drastically and dynamically over time. 

Recently, there has been significant interest in the use of discriminative methods for 
object tracking over the use of generative methods. In contrast to generative methods, 
discriminative methods directly model the conditional probability distribution of states 
given observations, and relax the conditional independence assumption made by 
generative methods. In particular, conditional random fields (CRT) are the most 
well-known discriminative graphical models used for the purpose of structured 
prediction, and have shown in a large number of studies to outperform generative 
models such as Markov random fields [^. Motivated by this, a number of CRF-based 
methods have been proposed for the purpose of object tracking. Taycher et al. 
proposed a human tracking approach using GRFs, with an Li similarity space 
corresponding to the potential functions. Different poses were considered as tracked 
states within a video sequence, where as the number of states must be predefined by the 
user. Sigal et al. used two-layer spatio-temporal models for component-based 
detection and object tracking in video sequences. In that work, each object or 
component of an object was considered as a node in the graphical model at any given 
time. Moreover, the graph edges correspond to learned spatial and temporal constraints. 
Following this work, Ablavsky et al. 10 proposed a layered graphical model for the 
purpose of partially-occluded object tracking. A layered image plane was used to 
represent motion surrounding a known object that is associated with a pre-computed 
graphical model. GRFs have also been applied to image-sequence segmentation [TT|[T^ , 
where the random fields are modeled using spatial and temporal dependencies. 

proposed the concept of temporal conditional random fields (TCRF) 


Shafiee et al. 13 


for the purpose of object tracking, where the object’s next position is estimated based 
on the current video frame, and then subsequently refined via template matching based 
on a subsequent video frame. 

The use of GRFs specifically related to object silhouette tracking is more recent and 
as such more limited in existing literature. Ren and Malik 14 proposed the use of 
GRFs for object silhouette segmentation in video sequences where the background and 
foreground distributions are updated over time. In the work by Boudoukh et al. 


15 


target silhouette is tracked on a video sequence by fusing different visual cues through 
the use of a CRF. In particular, temporal color similarity, spatial color continuity, and 
spatial motion continuity were considered as the CRF feature functions. The key 
advantage of this method for object silhouette tracking is that pixel-wise resolution can 
be achieved. 

While such CRF-based approaches to object silhouette tracking shows significant 
promise, one inherent limitation that is faced is that the existing CRF models used to 
predict the object silhouettes for one video frame are limited in their ability to take 
greater advantage of information from other video frames. One can use more complex 
CRF models to increase modeling power to address these limitations for improved object 
silhouette tracking, but it would also significantly increase computational complexity as 
well as model learning complexity. Recently, the concept of deep-structured models have 
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been proposed to facilitate for increased modeling power without the significant increase 
in computational complexity and model learning complexity incurred by complex CRF 
models. Deep-structured CRF models make use of intermediate state layers to improve 
structured prediction performance, where there is an inter-layer dependency between 
each layer on its previous layer. Ratajczak et al. 16 proposed a context-specific deep 


CRF model where the local factors in linear-chain CRFs are replaced with sum-product 
networks. Yu et al. 


17 18 proposed a deep-structured CRF model composed of 


multiple layers of simple CRFs, with each layer’s input consisting of the previous layer’s 
input and the resulting marginal probabilities. Given that the problem of object 
silhouette tracking is one where a set of video frames can contribute to predicting the 
object silhouette in a new video frame, one is motivated to investigate the efficacy of 
deep-structured CRF models for solving this problem. 

In this work, we propose an alternative framework for state-based object silhouette 
tracking based on the concept of deep-structured discriminative modeling. In particular, 
we introduce a deep-structured conditional random field (DS-CRF) model consisting of 
a series of state layers, with each state layer spatially characterizes the object silhouette 
at a particular point in time. The interactions between adjacent state layers are 
established by inter-layer connectivity dynamically determined based on inter-frame 
optical flow. By incorporate both spatial and temporal context in a dynamic fashion 
within such a deep-structured probabilistic graphical model, the proposed DS-CRF 
model allows us to develop a framework that can accurately and efficiently track object 
silhouettes that can change greatly over time. Furthermore, such a modeling framework 
does not require distinct stages for prediction and update, and does not require 
independent training for the dynamics of each object silhouette being tracked. 
Experimental results show that the proposed framework can estimate object silhouettes 
over time in situations where there is occlusion as well as large changes in object 
silhouette appearance over time. 


Materials and Methods 

Within a statistical modeling framework, one can describe the problem of object 
silhouette tracking as a classification problem, where the goal is to classify each pixel in 
a video frame as either foreground (part of the object silhouette) or background. The 
goal is to maximize the posterior probability of the states given observations P{Y\M), 
where, Y is the state plane characterizing the object silhouette and M is corresponding 
observations (e.g., video). Discriminative models, such as CRFs, derive the posterior 
probability P(Y\M) directly and as such do not require the independence assumptions 
necessary for generative modeling approaches. In the proposed DS-CRF modeling 
framework for object silhouette tracking, the object silhouette and corresponding 
background at the pixel level for each video frame is characterized by a state layer, 
which the series of state layers interconnected based on inter-frame optical flow 
information to form a deep-structured conditional random field model that facilitates 
for interactions amongst adjacent state layers. A detailed description of CRFs in the 
context of object silhouette tracking, followed by a detailed description of the proposed 
DS-CRF model, is provided below. 


Conditional Random Fields 


Conditional random fields (CRFs) are amongst the most effective and widely-used 
discriminative modeling tools developed in the past two decades. The idea of CRF 
modeling was first proposed by Laffety et al. 


19 ; based on the Markov property, the 


CRF directly models the conditional probability of the states given the measurements. 
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without requiring the specification of any sort of underlying prior model, and relaxes 
the conditional independence assumption commonly used by generative models. 

Formally, let G = (V, E) be an undirected graph such that yi G Y is indexed by the 
vertices oi Vi G V in G. (Y, M) is said to be a CRF if, when globally conditioned on M, 
the random variables yi obey the Markov property with respect to the graph G. In 
other words, where V — {z} is the set of all nodes in G 

except node z, Y is a set of output variables that we aim to predict, and JVi and M are 
the sets of neighbors of z and of observed input variables, respectively. The general form 
of a CRF is given by 


P(Y|M) = 
Z{M) = 


Z{M) riceCMe) 

Thy ricGC Me) 


( 1 ) 


where Z{M) is a normalization constant, essentially the so-called partition function of 
Gibbs fields, with respect to all possible values of Y, C represents the set of all cliques, 
and '0c encodes potential functions with a non-negative value condition. 

According to the non-negative constraint for 0c, and based on the Principle of 
Maximum Entropy [20| , a proper probability distribution is the one that maximizes the 
entropy, given the constraints from the training set 
CRF is then given by 


21 . As such, a new form of the 




CGC (t>a&C 


( 2 ) 


where is a feature function with respect to clique 0c, and A denotes the 

weight of each feature function to be learned. The feature function expresses the 
relationship amongst the random variables in a clique. The number of feature functions 
with respect to each clique is denoted by k. 

Two-dimensional CRFs have been applied to many computer vision problems, such 
as segmentation and classification. In particular, because of the undirected structure of 
most images, the 2D CRF leads to efficient performance in computer vision [7| [^[^ . 
Although early CRFs incorporate spatial relationships (spatial feature functions) 
amongst random variables into the model, these relationships repeat sequentially in 
many applications such as visual tracking, where incorporating this property into the 
framework can lead to better modeling. 

Feature functions play an important role in the context of CRF modeling. Selecting 
appropriate feature functions speeds up the convergence of the CRF training process, 
whereas inappropriate feature functions can cause inconsistent results in CRF inference. 
To illustrate the importance of selecting appropriate feature functions for object 
silhouette tracking, we train a CRF for predicting the object silhouette at one frame 
based on the previous frame using only spatial feature functions without incorporating 
any feature function describing temporal relationship amongst frames. Two frames 
consist of a simulated object which has small movement between two frames. As seen in 
Fig.[TJ the prediction result of the object silhouette is poor as the CRF could not learn 
object motion dynamics in the absence of temporal feature functions, leading to poor 
object silhouette tracking performance. 

To tackle this issue of selecting appropriate feature functions to improve tracking 
performance, Shafiee et al. 13 proposed the incorporation of temporal feature functions 


such as inter-frame optical flow into the CRF modeling framework to better take 
advantage of temporal relationships for visual tracking. Although this approach showed 
promising results and illustrated the feasibility of temporal processing for visual 
tracking in the CRF modeling framework, it only makes use of motion information from 
the previous frame to estimate object position in the current frame and as such cannot 
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Training Step 



Figure 1. Example of CRF modeling of object motion using only spatial 
feature functions for object silhouette tracking. The first column (A) shows the 
temporal observation and second column (B) shows the label used for training (two 
columns are two consecutive frames at t — 1 and t). The third column (C) shows the 
prediction result of the object silhouette. As seen the prediction result of the object 
silhouette is poor since the CRF could not learn object motion dynamics in the absence 
of temporal feature functions, leading to poor object silhouette tracking performance. 


handle large motion dynamics changes nor shape changes over time, or can it handle 
accelerated motion dynamics. Furthermore, it is designed for object position tracking 
and does not handle object silhouette tracking. Therefore, motivated by the benefits of 
incorporating both spatial and temporal context in a dynamic fashion in a manner that 
addresses the aforementioned issues, we propose a deep-structured CRF (DS-CRF) 
model for object silhouette tracking, where the series of interconnected state layers 
making up the model along with the set of corresponding temporal observations allow 
for better modeling of more complex motion and shape dynamics that can occur in 
realistic scenarios. 


Deep-structured Conditional Random Fields 

Here, we will describe the proposed DS-CRF model in detail as follows. First, the graph 
representation of the DS-CRF model is presented. Second, the manner in which 
inter-layer connectivity within the DS-CRF model is established dynamically based on 
motion information derived from temporal observations is presented. Third, a set of new 
feature functions incorporated in the DS-CRF model for object silhouette tracking is 
presented. 

demonstrates the flow digram of the proposed framework. Several features 
such as optical flow are extracted from observed frames to track the new target location 
in the video. The tracking result is encoded as a black and white field demonstrating 
the target location with black pixels. 

Graph Representation 

Let the graph G{V, E) represent the proposed DS-CRF model, which consists of several 
state layers Y* : Yi corresponding to times < : 1 as shown in Fig. Each state layer 
characterizes the object silhouette at a specific time step by modeling the conditional 
probability of Yj given the previous states of the object in times t — 1:1 and their 
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Figure 2. The flow diagram of the proposed framework. The observed frames 
are utilized to predict the new target position. Several feature functions are 
incorporated into the model. The extracted features from the observed frames are 
utilized as the measurement layer while the new object location inferred in a black and 
white plane is considered as the label field (tracking result) in the random field. 


corresponding observations: 

= \ TTexp(VAfc,,/fc,,(y,,„Mta,Tt-i:i)). (3) 

where Z{Mt-,i) is a normalization constant, C is the set of inter-layer and intra-layer 
cliques, \k,c determines the weight of each feature function, and fk,c{') denotes the 
feature function over clique c. The intra-layer connectivity between nodes in each layer 
(i.e, eT in layer Yj, Fig. imposes the smoothness property of the target object into 
the model while the inter-layer connectivity between two adjacent state layers (i.e., 
for layers Yt and Yt-i corresponding to node yk) incorporate object motion 
dynamics into the model. As such, the inter-layer connectivity carries the energy 
corresponding to unary potential in the model, and are specified dynamically and 
adaptively in the proposed framework based on motion information derived from 
temporal observations, which will be described in detail in the next section. 



Figure 3. Graph representation of the deep-structured conditional random 
field model. In this model, the labels Yj_i : Yi and all observations can be 
incorporated to model the object silhouette’s motion. Two different types of clique 
connectivity exist in this model: i) inter-layer clique connectivity between two nodes 
within the same layer as shown by eT, and ii) inter-layer clique connectivity between 
nodes in two adjacent layers. The two end-nodes of an inter-layer clique are determined 
based on the motion information. 

To reduce computational complexity, the implementation of the DS-CRF model in 
this work will make use of the three previous frames as observations in the modeling of 
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the conditional probability: 



The use of the last three frames is chosen as it provides sufficient information to 
reasonably model accelerated motions given that both acceleration and velocity can be 
computed, thus allowing for handling various motion situations in short time steps. 

Motion-guided inter-layer connectivity 

As discussed above, the intra-layer connectivity between nodes in a layer incorporate 
spatial context while the inter-layer connectivity between layers incorporate temporal 
context into the DS-CRF model. The simplest approach to establishing inter-layer 
connectivity between nodes from different layers would be to simply create inter-layer 
cliques between nodes that represent the same spatial location at two different time 
steps. This creates a simple regular spatial-temporal lattice that is fixed across time. 
However, this is not appropriate for object silhouette tracking, as temporal neighbors 
established under such a fixed inter-layer connectivity structure would not share 
relevant information since target objects that are undergoing drastic motion and shape 
changes over time, and thus the feature functions under such a structure would hold 
little meaning. Therefore, we are motivated to establish the inter-layer connectivity in 
the propose DS-CRF model in a dynamic and adaptive manner, where motion 
information derived from the temporal observations is used to determine the inter-layer 
cliques at each state layer. 

In this work, we dynamically determine inter-layer cliques of each node at each layer 
Yt of the DS-CRF model based on the velocity obtained by inter-frame optical flow 
computed by two consecutive temporal observations Mt and Mt-i- 



where j/c,t is an inter-layer clique in time t, is a node in time t and yk,t-i is its 
neighbor node in time t — \ based on the inter-layer clique connectivity, and Vy 
encode the velocities in both directions of x and y where node i'm x direction (i.e., ix) 
is consistent with node k (i.e., kx) based on Vx and the same manner for y direction. 

An illustrative example of the motion-guided inter-layer connectivity strategy is 
shown in Fig. where the inter-layer clique structures are established at Yt and Yt_i 
based on the inter-frame optical flow between temporal observations Mt and Mt-i- It 
can be seen that the nodes corresponding to the target object (indicated here as gray 
nodes) (e.g., yt^t, yi,t, and yk,t) form inter-layer clique structures with nodes from the 
previous state layer that characterize different spatial locations than them due to 
motion, while nodes corresponding to the background (indicated by white nodes) (e.g., 
yj^t) form inter-layer cliques with nodes from the previous state layer corresponding to 
the same spatial location since there is no motion at that position. As such, this 
motion-guided dynamic inter-layer connectivity strategy allows for better 
characterization of temporal context of the object silhouette being tracked and allow the 
feature functions to hold meaning. 

Feature Functions 

In addition to the inter-layer connectivity between state layers, it is important to also 
describe the feature functions being incorporated into the proposed DS-CRF model. 
The set of feature functions are: i) optical flow, ii) target appearance, hi) spatial 
coherency, and iv) edge. 
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Figure 4. Inter-layer connectivity. The inter-layer connectivity between nodes in 
two adjacent layers are determined based on the motion information computed in two 
consecutive frames. The inter-layer cliques are constructed dynamically and adaptively 
based on motion information corresponding to each node in layer Yj. As seen, the 
corresponding temporal neighbor of node yi^t is yk,t-i based on inter-layer clique 
structure are determined by use of inter-frame optical flow. The gray color correspond to 
nodes associated with the target object where there is movement in the previous frame. 


Optical Flow This crucial feature function is described by the velocity of each pixel in 
the X and y directions in two adjacent frames and is estimated via inter-frame optical 
flow. Optical flow is an approximation of motion based upon local derivatives in a given 
sequence of images 24 . It specifies the moving distance of each pixel in two adjacent 
images: 


y, t) = I{{x + Sx),{y + Sy), {t + 6t)) 


( 6 ) 


where {Ix,Iy) 


denotes the spatial intensity gradient and Vx and Vy denotes motion in 


both directions (here, in this implementation, 6t = 1). Optical flow assumes the change 
in a pixel’s intensity corresponds to the displacement of pixels 
optical flow is applied between two temporal observations. 


13 . Here, inter-frame 


Target Appearance The model utilizes simple unary appearance feature functions 
based on features describing the target object’s appearance, including RGB color and 
target appearance in previous frame. To obtain this unary feature function, the label 
state of time t — 1 is shifted by the computed velocity and find the corresponding value 
for each node: 


f{yi,t,M) = S{Yt_i,Vx,Vy) 


( 7 ) 


where 5(-) shifts Yt_i based on velocities Vx and Vy 


Spatial Coherency Each target in the scene has spatial color coherency. This term 
implies the reflection between neighboring nodes in the image. Each node consisted to a 
target has strong relations with other nodes corresponding to the target silhouette. In 
other words, the target appearance is coherent in each time frame. By adding this 
feature function to the DS-CRF tracking framework, the proposed algorithm can track 
target object’s silhouette despite large changes over time. A rough segmentation 
algorithm 25 enforces the label consistency among nodes with a segment produced by 


the segmentation result of frame Mt- 
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Edge The Ising model is the ordinary edge energy function utilized in different 
problems which we incorporated to the model as spatial smoothness feature function: 

fivi.t, yj,t, Mt) = (8) 

where /r(-) is the penalty function based on the similarity of two nodes i and j. 

Training and Inference 

Maximum likelihood is a common method to estimate the parameters of CRTs. As such, 
the training of the proposed DS-CRF is done by maximizing log-likelihood £ upon the 
training data: 

cGC k 

Because the log-likelihood function i{X) is concave, the parameters A can be chosen 
such that the global maximum is obtained and the gradient or vector of partial 
derivatives with respect to each parameter Xk becomes zero. Differentiating i(X) with 
respect to the parameter Xk gives: 


di 

Wk 


,c(Tc.t,Mt_2:t,yt-l) -^/fe,c(h;,t,Mt_2:t,Yt-l)P(T'|X)). ( 10 ) 

cGC Y' 


An exact solution does not exist; therefore, the parameters are determined iteratively 
using gradient descent optimization. Our DS-CRF training is performed via the belief 


propagation method 26 


After the training of the DS-CRF, inference is performed by evaluating the 
probability of each random variables in the represented graph given the observations 
and Ft_i, while decoding is performed by assigning the output variable Y — 
determining states with maximum probability: 


Py, = P{Yt = y,\Mt-2-.t-YYt-i) Vy, S F 
Y* = argmaxP(yt|Mt_2:t-i,Ft_i). 

Y 


( 11 ) 

( 12 ) 


where Eq. (11) and Eq. (12) show the formal definition of the inference and decoding 
process, respectively. 


DS-CRF Tracking Framework 

Based on the DS-CRF model described above, we can then develop a state-based 
framework for tracking object silhouettes across time in a video sequence as follows. 
The first two frames are annotated by user as initialization. The velocity is computed 
based on these two frames and DS-CRF starts the tracking procedure by third frame. 
DS-CRF can track objects automatically after frame 2. The optical flow is performed by 
used on two last seen frames each time. Since the optical flow is computed for each time 
frame and parameters were trained based on the velocity, the model needs to train only 
one time. 

The DS-CRF essentially plays the rule of fusing spatial context such as target object 
shape and appearance with temporal context such as motion dynamics within the 
proposed tracking framework. The contribution of each aspect of the spatial and 
temporal information within the DS-CRF model based on the weights learned during 
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the training step. The inference process described in Eq. 11 is then performed based on 
the learned DS-CRF. 

To reduce the computational complexity of the proposed DS-CRF tracking 
framework, the decoding result in each step (see Eq. 12) utilized to obtain the object 
silhouette. The decoding result consists of a binary label field V (i.e., each pixel has a 
value y = {0,1}, where 0 indicates object silhouette pixels and 1 indicates background 
pixels). An example of a temporal observation (i.e., video frame) and its corresponding 
binary label field is shown in Fig. The use of a binary label setup allows for not only 
reduced computational complexity of the training process, but also the convergence. 



Figure 5. An example of a temporal observation and the corresponding 
label field. The red oval indicates the target object being tracked. 

One issue that needs to be tackled when using a binary label setup is that, while 
well suited for single target object silhouette tracking, it is less appropriate for 
multi-target object silhouette tracking. To address this issue, we introduce a data 
association procedure where connected components in the binary label field are assigned 
to the target objects being tracked. This is accomplished by matching the object 
silhouettes determined for the previous time step to the connected components in the 
binary field at the current time step to determine the best template matches: 

Tj{t) = argmaxAi{ci,Tj{t — 1)) (13) 

CiGC 

where Tj(t) is the target j’s silhouette in time t, C is the set of connected components 
detected as targets and Ai{-) is template matching function that evaluates the similarity 
of two input silhouettes. 


Results 

To evaluate the performance of the proposed DS-CRF model for the purpose of object 
silhouette tracking, a number of different experiments were performed to allow for a 
better understanding and analysis of the model under different conditions and factors. 
First, a set of experiments involving video of a simulated object with different motion 
dynamics is performed to study the capability of the DS-CRF model in handling objects 
with changing motion dynamics. Second, a set of experiments performed on videos of 
humans moving within a subway station from the PETS2006 database is used to study 
the capability of the DS-CRF model in handling object silhouette tracking scenarios 
where there is occlusion and objects that change drastically in shape and size over time. 

Experiment 1: Simulated object motion undergoing acceleration 

In this experiment, we examine the ability of the proposed DS-CRF method in tracking 
the silhouette of an object with different motion dynamics over time. To accomplish 
this, we produce three video sequences consisting of a simulated object undergoing the 
following motion dynamics over time: 
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• Motionl: Object undergoes acceleration but remain constant in shape over time. 

• Motion2: Object undergoes size change over time but moves at constant velocity. 

• MotionS: Object undergoes acceleration as well as size change over time. 

Sample frames from the video sequences Motionl, Motion2, and MotionS is shown in 
the first rows of Fig. [^a),(b), and (c), respectively. The proposed method was then 
used to predict the object silhouette over time based on these video sequences. The 
predicted results are shown in the second rows of Fig. [^a),(b), and (c), respectively. It 
can be observed that the proposed method is able to provide accurate object silhouette 
tracking results for all three video sequences, thus illustrate its ability to handle 
uncertainties in both motion and object appearance over time. 


Experiment 2: Real-life video of human targets 

In this experiment, we examine the capability of the DS-CRF model in handling object 
silhouette tracking scenarios where there is occlusion and objects that change drastically 
in shape and size over time. To accomplish this, we made use of three different video 
sequences from the PETS2006 database depicting human targets moving within a 
subway station (one of which is used for evaluation in [^), each used to illustrate 
different aspects of the capability of the proposed method: 

• Subwayl: This sequence is used to illustrate the capability of the proposed 
method in handling single object silhouette tracking over time. The object target 
in this sequence is crossing the hallway from the top of the scene to the bottom of 
the scene. 

• Subway2: This sequence is used to illustrate the capability of the proposed method 
in handling object occlusions. The object target in this sequence is crossing the 
hallway from the right of the scene to the left of the scene, and becomes occluded 
by a person walking from the left of the scene to the right of the scene. 

• SubwayS: This sequence is used to illustrate the capability of the proposed 
method in handling multiple object silhouette tracking over time. Two of the 
target objects in this sequence is crossing the hallway from the bottom of the 
scene to the top of the scene, while a third object target is crossing the hallway 
from the top of the scene to the bottom of the scene. 


The PETS2006 database is a public dataset which is available from 
http://www.cvg.reading.ac.uk/PETS2006/data.html 

To provide a comparison for the performance of the proposed method, four different 
existing tracking methods are also evaluated: 


Mean-shift tracking 27 Mean-shift tracking is based on non-parametric feature 


space analysis, where the goal is to determine the maxima of a density function, 
which in the case of visual tracking is based on the color histogram of target 
object. This goal is achieved via an iterative optimization strategy that locates 
the new target object position near the previous object position based on a 
similarity measure such as Bhattacharyya distance. 


• Context tracking 28 Context tracking is a discriminative tracking approach 
which utilizes a specific trained detector in a semi-supervised fashion to locate the 
target in consecutive frames. The goal of this method is to locate all possible 
regions that look similar to the target. Context tracking then identifies and 
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(a) Simulated Motionl: The first row shows and object motion which 
the motion information is {ua;, uqx, Uy, noy} = {3, 2,2, 2}.The Second 
row shows the DS-CRF prediction resnlt for each frame based on two 
last observed frames. Clearly, DS-CRF can learn accelerated motion 
properly. Each block represents 40 X 30 pixels. 
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(b) Simulated Motion2: The object shape is changed over time while 
object moves with constant velocity. Our DS-CRF prediction result of 
each frame by using two last observed frames is shown in second row. 
DS-CRF can learn object shape variation as well as object motion. Each 
block represents 20 X 20 pixels. 
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(c) Simulated MotionS; Object motion dynamic is accelerated and 
object shape is changed over time. The motion information is 
nox, Uy, noy } = {3,2,2, 2 }. 


Figure 6. Examined simulated motion results. DS-CRF is examined with three 
different simulated motion and shape dynamics over time. In (a) the object undergoes 
acceleration but remain constant in shape over time, in (b) the object undergoes size 
change over time but moves at constant velocity, and in (c) the object undergoes 
acceleration as well as size change over time. 
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Table 1. Quantitative results for different video sequences. The accuracy of Visual 
Silhouette Tracker method is reported for one sequence since only one sequence of result 

MST, CT, VST, BPF are refereed to 
Visual Silhouette Tracker 


has been provided by the authors of 15 
Mean-shift tracking 


27 , Context tracking 28 


15 and 


Video Name 

MST 27 

CT 

28 


VST 

15 


BPF 

DS-CRF 

Subwayl 

88% 

92% 

100% 

100% 

100% 

Subway 2 

16% 

42% 

NA 


100% 

100% 


differentiates the target object from the ‘distracters’ within the set of possible 
regions based on a confidence measure derived based on the posterior probability 
and supporting features. 


Boosted particle filtering Particle filtering is a discriminative tracking approach 
that approximates the posterior P(Vi|Mo:t) with a Dirac measure using a finite 
set of N particles {V/}i=i,..Ar. The sample candidate particles are drawn based on 
the proposal distribution. The importance weight of each particle is then updated 
according to its previous weight and the importance function, which is often the 
transition prior. After that, the particles are re-sampled using their importance 
weights. Here, we employed the boosted particle filter proposed in [^, which 
incorporates mixture particle filtering 29 that is ideally suited to multi-target 
tracking. 


Visual Silhouette Tracker 15 The visual silhouette tracking method fuses 


different visual cues by means of conditional random fields. The object silhouette 
is estimated every frame according to visual cues including temporal color 
similarity, spatial color continuity and spatial motion continuity. The incorporated 
energy functions are minimized within a conditional random field framework. 


Note that for the mean-shift tracking and context tracking methods are only 
evaluated for the Subwayl and Subway2 sequences as the implementations used were 
not designed for tracking multiple object targets within the same scene. The visual 
silhouette tracking method was only compared for the Subwayl sequence as only the 
object silhouette results for that sequence was provided by the authors of 15 . Finally, 


the boosted particle filtering and proposed DS-CRF method was evaluated for all three 
sequences (Subwayl, Subway2, and SubwayS). 

To compare methods quantitatively, the number of frames which the tracker could 
track the object correctly divided by the total number of frames in the sequence is 
reported as the accuracy: 


Number of Corrected Tracked Frames 

Accuracy = --; -,- - - x 100. 

Total Number of Frames 


(14) 


Tabled shows the quantitative results for the Subwayl and Subway2 sequences while 
Table 1^ presents the result corresponding to SubwayS sequence. 

First, let us examine the performance of the proposed DS-CRF method in the 
situation where the object being tracked changes significantly in size and shape over 
time. Fig. shows the single-object object silhouette tracking results of the tested 
tracking methods for the Subwayl sequence. It can be observed that while the 
mean-shift tracking, context tracking, and boosted particle filtering methods lose the 
object target completely, both the visual silhouette tracking method and the proposed 
DS-CRF method is able to track the object silhouette all the way through. It can also 
be observed that the object silhouette obtained using the proposed method is more 
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Table 2. Comparison results for the SubwaysS sequence. The accuracy is reported as 
the average of the accuracy for all tracked targets since this dataset contains three 
targets. 


Video Name 

Boosted Particle Filtering 

DS-CRF 

SubwayS 

80% 

100% 


accurate than that obtained using the visual silhouette method. These results illustrate 
the capability of the proposed DS-CRF method in tracking the object silhouette over 
time in spite of drastic changes in size and shape over time. 




a) Pr<)l)(>s<Hl Mrtliod 


h) Mt'aiishift Ti'a<’k('r [27j 








Initial 


Frame #1(H) Frame #190 Frame #194 


Figure 7. Example tracking results for Subway 1. It can be observed that while 
the mean-shift tracking, context tracking, and boosted particle filtering methods lose 
the object target completely, both the visual silhouette tracking method and the 
proposed DS-CRF method is able to track the object silhouette all the way through. It 
can also be observed that the object silhouette obtained using the proposed method is 
more accurate than that obtained using the visual silhouette method. 


Next, let us examine the performance of the proposed DS-CRF method in the 
situation where the object being tracked undergoes occlusion by other objects over time. 
Fig.i shows the single-object object silhouette tracking results of the tested tracking 
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methods for the Subway2 sequence. It can be observed that while the mean-shift 
tracking, context tracking, and boosted particle filtering methods lose the object target 
completely, the proposed DS-CRF method is able to track the object silhouette all the 
way through despite being occluded by another person. These results illustrate the 
capability of the proposed DS-CRF method in tracking the object silhouette over time 
in spite of object occlusion. 



Initial 



a) Proposed Method 



I)) Mcanshift 



Tracker 




Frame #2.'3 


Frame #.52 




Figure 8. Example tracking results for Subway2. It can be observed that while 
the mean-shift tracking, context tracking, and boosted particle filtering methods lose 
the object target completely, the proposed DS-CRF method is able to track the object 
silhouette all the way through despite being occluded by another person. 

Finally, let us examine the performance of the proposed DS-CRF method in the 
situation where we wish to track multiple object silhouettes over time. Fig. shows the 
multiple-object silhouette tracking results of the tested tracking methods for the 
SubwayS sequence. It can be observed that while the boosted particle filtering method 
is able to track two of the three object targets completely, it loses one of the object 
targets as a result it crossing paths with one of the other object targets. Furthermore, 
the boosted particle filtering method does not provide pixel-level object silhouettes and 
is able to only track bounding boxes. On the other hand, the proposed DS-CRF method 
is able to track all three of the object silhouettes at the pixel-level all the way through. 
These results illustrate the capability of the proposed DS-CRF method in tracking 
multiple object silhouettes over time in a reliable manner. 
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Figure 9. Example tracking results for SubwayS. It can be observed that while 
the boosted particle filtering method is able to track two of the three object targets 
completely, it loses one of the object targets as a result it crossing paths with one of the 
other object targets. Furthermore, the boosted particle filtering method does not 
provide pixel-level object silhouettes and is able to only track bounding boxes. On the 
other hand, the proposed DS-CRF method is able to track all three of the object 
silhouettes at the pixel-level all the way through. 


Discussion 

Here, we proposed a deep-structured conditional random field (DS-CRF) model for 
object silhouette tracking. In this model, a series of state layers are used to characterize 
the object silhouette at all points in time within a video sequence. Connectivity 
between state layers formed dynamically based on inter-frame optical flow allows for 
interactions between adjacent state layers to facilitate for the utilization of both spatial 
and temporal context within a deep-structured probabilistic graphical model. 
Experimental results showed that the proposed DS-CRF model can be used to facilitate 
for accurate and efficient pixel-level tracking of object silhouettes that can change 
greatly over time, as well as under different situations such as occlusion and multiple 
targets within the scene. Experiment results using both simulated data and real-world 
video datasets containing different scenarios demonstrated the capability of the 
proposed DS-CRF approach to provided strong object silhouette tracking performance 
when compared to existing tracking methods. 

One of the main contributing factors to the proposed method’s ability to handle 
uncertainties in object motion dynamics and size and shape changes over time is in the 
way the inter-layer connectivity is established dynamically based on inter-frame optical 
flow information. If the inter-layer connectivity is established statically at all state 
layers of the deep-structured model, then the feature functions would hold little 
meaningful relationships in the temporal domain as the object accelerates and changes 
size over time. By making use of inter-frame optical flow information to determine 
inter-layer connectivity between adjacent state layers, the feature functions maintain 
meaning over time in guiding the prediction process. Another important contributing 
factor is the incorporation of object shape feature functions (spatial coherency) enforces 
the proposed method to consider object shape variations in time, which also aids in the 
handling of changes in size and shape over time. 

Future work involves extending the proposed DS-CRF model to incorporate not only 
inter-frame optical flow information, but also additional motion information via 
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descriptor matching to better guide the establishment of inter-layer connectivity in the 
situation of large object displacements within a short time in the video sequence. 
Furthermore, we aim to explore the extension of the DS-CRF model with high-order 
and fully-connected clique structures 30 to improve modeling of spatial relationships 


for better object silhouette boundaries. Finally, we aim to explore the application of the 
proposed DS-CRF model for the purpose of improved video saliency detection using 
texture distinctiveness-based feature functions 


31 ■ 33 and improved content-based 


video retargeting using energy gradient feature functions 34 
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